0.1 Report Set-up

Load all required packages for this report and define some global customization settings.

# for knitting the .rmd file to .html
if (! requireNamespace("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
if (! requireNamespace("kableExtra", quietly = TRUE)) {
  install.packages("kableExtra")
}

# for creating models and analyzing expression data
if (! requireNamespace("edgeR", quietly = TRUE)) {
  BiocManager::install("edgeR")
}
if (! requireNamespace("limma", quietly = TRUE)) {
  BiocManager::install("limma")
}
if (! requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (! requireNamespace("gprofiler2", quietly = TRUE)) {
  install.packages("gprofiler2")
}

# for creating plots
if (! requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (! requireNamespace("VennDiagram", quietly = TRUE)) {
  install.packages("VennDiagram")
}
if (! requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BiocManager::install("ComplexHeatmap")
}
if (! requireNamespace("circlize", quietly = TRUE)) {
  BiocManager::install("circlize")
}
if (! requireNamespace("fgsea", quietly = TRUE)) {
  BiocManager::install("fgsea")
}
if (! requireNamespace("qusage", quietly = TRUE)) {
  BiocManager::install("qusage")
}
if (! requireNamespace("clusterProfiler", quietly = TRUE)) {
  BiocManager::install("clusterProfiler")
}
if (! requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
  BiocManager::install("org.Hs.eg.db")
}

library(clusterProfiler)
library(org.Hs.eg.db)
library(edgeR)
library(dplyr)
library(ggplot2)
library(fgsea)
library(qusage)
library(VennDiagram)

# wraps lines that are too long
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=80), tidy=TRUE)
# set default behaviors for all chunks
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

All table and figure outputs in this report are rendered using the knitr package [1], and kableExtra package [2] for styling. Some codes in this report are adapted from Lecture 10 - GSEA.


1 Introduction

1.1 Background on the Data Set

Glucocorticoids (GC) is a class of steroid hormones that plays a role in regulating the immune system and certain aspects of the immune function, such as inflammation. The most common naturally-produced GC hormone in humans is cortisol, which are produced by the adrenal glands. Due to GC’s potent anti-inflammatory effects, several synthetic forms of GC are used as immunosuppressive drugs to treat various medical conditions such as asthma, allergies, autoimmune disorders, and cancer. [3]

The data set used in this report comes from Quatrini et al. (2022)’s paper: Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor [3], where they analyzed the the role of GC on regulating innate lymphoid cells (ILCs) differentiation, including cytotoxic natural killer cells and helper ILCs, from human hematopoietic stem cells (HSCs). The RNA-seq data set used in this report is a part of the overall study; it evaluates the effect of the presence of Dexamethasone (DEX), a glucocorticoid medication, on gene expressions of HSCs. The raw data set is downloaded from the NCBI Gene Expression Omnibus, with ID GSE186950. [4]

1.2 Recap of Assignment #1 and #2

In Assignment 1, I performed initial pre-processing and normalization of the data set. The raw data set contains 2 groups, control and conditioned (DEX), with 3 samples in each group: AF25, AF26, AF29. The data set initially contained gene expressions of 58683 genes, but 46243 genes were removed due to low counts with less than 1 count per million in at least 3 samples, leaving 12440 genes.

In the original data set, genes are labeled in a mix of HUGO gene symbols, GenBank accession IDs, EMBL identifiers, etc. Using the biomaRt package [5], we attempted to map non-HUGO identifiers to their corresponding HUGO gene symbols, but had to discard 847 (6.8%) genes with no matches, leaving a final set of 11593 unique genes.

knitr::include_graphics("figures/A2/raw_fltr_data_cpm.png")

Figure 1. log2 CPM Distributions of gene expressions of each sample in the data set, after filtering out genes with low-counts or without HUGO gene symbol match, and before TMM normalization. This figure is adapted from Assignment 1 with slight aesthetic modifications.

As shown in Fig.1, The filtered data set is approximately normally distributed with no outlier samples, and therefore we normalized the data using the Trimmed Mean of M-values (TMM) method, via the edgeR package. [6]

In Assignment 2, using the normalized data set, we built two models to identify the set of significantly differentially expressed genes. Then, we selected one of the models (a treatment-only model built using the exact test of the edgeR package) and performed thresholded over-representation analysis on genes that are significantly differentially expressed, based on a threshold of FDR-correct p-value < 0.05.

In this assignment, we’ll start with the result from the treatment-only exact test model and load it as an .rds file. The top_et.rds file can be generated by running saveRDS(top_et, 'top_et.rds') after running all code chunks in A2_JunNi_Du.Rmd.

top_et <- readRDS("top_et.rds")
et_output <- top_et$table

# calculate rank for each gene
et_output[, "rank"] <- -log(et_output$PValue, base = 10) * sign(et_output$logFC)
et_output <- et_output[order(et_output$rank), ]

2 Non-thresholded Gene set Enrichment Analysis

rnk <- data.frame(rownames(et_output), et_output$rank)
colnames(rnk) <- c("GeneName", "rank")
write.table(rnk, file = "./dex_ctrl.rnk", sep = "\t", col.name = TRUE, row.names = FALSE,
    quote = FALSE)
gsea_upreg <- read.table(file = "gsea_report_for_na_pos_1680460999801.tsv", sep = "\t",
    header = TRUE, fill = TRUE)
gsea_downreg <- read.table(file = "gsea_report_for_na_neg_1680460999801.tsv", sep = "\t",
    header = TRUE, fill = TRUE)

2.1 Non-thresholded Gene set Enrichment Analysis Questions

1. What method did you use? What genesets did you use? Make sure to specify versions and cite your methods.

2. Summarize your enrichment results.

The GSEA pre-ranked identified 2590 up-regulated gene sets out of 5263 total gene sets, with 327 gene sets having nominal p-value < 0.05 and 1 gene set having FDR q-value < 0.25, which indicates significant enrichment.

For down-regulated gene sets, 2673 down-regulated gene sets out of 5263 total gene sets were identified, with 345 gene sets having nominal p-value < 0.05 and 62 gene set having FDR q-value < 0.25, which indicates significant enrichment.

pathway_names <- sapply(stringr::str_split(gsea_upreg$NAME, "%"), "[", 1:2)
pathway_names <- t(pathway_names)
gsea_upreg <- cbind(pathway_names, gsea_upreg)

upreg <- gsea_upreg[, c("1", "2", "SIZE", "NES", "NOM.p.val", "FDR.q.val")]
colnames(upreg) <- c("Term", "Data Source", "Size", "Normalized Enrichment Score",
    "Nominal p-value", "FDR q-value")
kableExtra::kable_paper(knitr::kable(head(upreg[which(upreg$"Nominal p-value" < 0.05),
    ]), format = "html", digits = 10))
Term Data Source Size Normalized Enrichment Score Nominal p-value FDR q-value
NEUTROPHIL DEGRANULATION REACTOME DATABASE ID RELEASE 83 418 2.066326 0 0.3181236
OXIDATIVE PHOSPHORYLATION GOBP 93 2.066148 0 0.1602628
RESPIRATORY ELECTRON TRANSPORT, ATP SYNTHESIS BY CHEMIOSMOTIC COUPLING, AND HEAT PRODUCTION BY UNCOUPLING PROTEINS. REACTOME 117 2.007220 0 0.4715452
RESPIRATORY ELECTRON TRANSPORT REACTOME DATABASE ID RELEASE 83 97 1.996420 0 0.4438916
HALLMARK_OXIDATIVE_PHOSPHORYLATION MSIGDBHALLMARK 149 1.993352 0 0.3759924
THE CITRIC ACID (TCA) CYCLE AND RESPIRATORY ELECTRON TRANSPORT REACTOME 161 1.968398 0 0.4876496
pathway_names <- sapply(stringr::str_split(gsea_downreg$NAME, "%"), "[", 1:2)
pathway_names <- t(pathway_names)
gsea_downreg <- cbind(pathway_names, gsea_downreg)

downreg <- gsea_downreg[, c("1", "2", "SIZE", "NES", "NOM.p.val", "FDR.q.val")]
colnames(downreg) <- c("Term", "Data Source", "Size", "Normalized Enrichment Score",
    "Nominal p-value", "FDR q-value")
kableExtra::kable_paper(knitr::kable(head(downreg[which(downreg$"Nominal p-value" <
    0.05), ]), format = "html", digits = 10))
Term Data Source Size Normalized Enrichment Score Nominal p-value FDR q-value
HALLMARK_INTERFERON_ALPHA_RESPONSE MSIGDBHALLMARK 71 -2.064144 0.000000000 0.000000000
T CELL ACTIVATION GOBP 121 -1.970077 0.000000000 0.009179046
INTERFERON ALPHA BETA SIGNALING REACTOME 50 -1.921399 0.001811594 0.033516850
HALLMARK_INTERFERON_GAMMA_RESPONSE MSIGDBHALLMARK 144 -1.895172 0.000000000 0.057725385
POSITIVE REGULATION OF SMALL GTPASE MEDIATED SIGNAL TRANSDUCTION GOBP 39 -1.847791 0.000000000 0.175786400
LYMPHOCYTE ACTIVATION GOBP 201 -1.846445 0.000000000 0.150864050

3. How do these results compare to the results from the thresholded analysis in Assignment #2. Compare qualitatively. Is this a straight forward comparison? Why or why not?

summary_A2 <- readRDS("summary_A2.rds")

upreg_comparison <- data.frame(summary_A2$`top_terms_up-reg`, tolower(upreg$Term[1:8]))
downreg_comparison <- data.frame(summary_A2$`top_terms_down-reg`, tolower(downreg$Term[1:8]))
colnames(upreg_comparison) <- c("A#2 ORA Top Terms (Up-Regulated)", "A#3 GSEA Top Terms (Up-Regulated)")
colnames(downreg_comparison) <- c("A#2 ORA Top Terms (Down-Regulated)", "A#3 GSEA Top Terms (Down-Regulated)")
kableExtra::kable_paper(knitr::kable(upreg_comparison, format = "html", digits = 10))
A#2 ORA Top Terms (Up-Regulated) A#3 GSEA Top Terms (Up-Regulated)
inflammatory response neutrophil degranulation
cell migration oxidative phosphorylation
Metabolic pathways respiratory electron transport, atp synthesis by chemiosmotic coupling, and heat production by uncoupling proteins.
Amino sugar and nucleotide sugar metabolism respiratory electron transport
Neutrophil degranulation hallmark_oxidative_phosphorylation
Innate Immune System the citric acid (tca) cycle and respiratory electron transport
Nuclear receptors meta-pathway respiratory electron transport chain
Glucocorticoid receptor pathway atp synthesis coupled electron transport
kableExtra::kable_paper(knitr::kable(downreg_comparison, format = "html", digits = 10))
A#2 ORA Top Terms (Down-Regulated) A#3 GSEA Top Terms (Down-Regulated)
signaling hallmark_interferon_alpha_response
cell communication t cell activation
Hematopoietic cell lineage interferon alpha beta signaling
Asthma hallmark_interferon_gamma_response
Cytokine Signaling in Immune system positive regulation of small gtpase mediated signal transduction
Signal Transduction lymphocyte activation
Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex antigen processing and presentation of exogenous antigen
T-cell receptor signaling pathway antigen processing and presentation

3 Cytoscape Visualization

3.1 Cytoscape Visualization Questions

1. Create an enrichment map - how many nodes and how many edges in the resulting map? What thresholds were used to create this map? Make sure to record all thresholds. Include a screenshot of your network prior to manual layout.

The enrichment map was created using the following thresholds:

  • FDR q-value cutoff 0.25
  • p-value cutoff 0.05
  • gene-set similarity filtering cutoff 0.375 (with default metric: Jaccard+Overlap combined)

The enrichment map has 63 nodes and 71 edges.

knitr::include_graphics("figures/A3/GseaPreranked.png")

2. Annotate your network - what parameters did you use to annotate the network. If you are using the default parameters make sure to list them as well.

Using the AutoAnnotate App available in Cytoscape with the following parameters:

  • Max number of annotations: 10
  • Cluster Options: Use clusterMaker App
    • Cluster algorithm: MCL Cluster
    • Edge weight column: None
  • Label Column: Name
  • Label Algorithm: WordCloud: Adjacent Words (default)
  • Max words per label: 3
  • Minimum word occurrence: 1
  • Adjacent word bonus: 8

3. Make a publication ready figure - include this figure with proper legends in your notebook.

The enrichment map below was created using the parameters above, arranged used GoSE layout and spatially-rearranged to be more compact.

knitr::include_graphics("figures/A3/EM_grouped.png")

4. Collapse your network to a theme network. What are the major themes present in this analysis? Do they fit with the model? Are there any novel pathways or themes?

knitr::include_graphics("figures/A3/theme.png")

Per the above figure, some of the major themes include immune system (e.g.lymphocyte mediated immunity, antigen receptor-mediated signaling pathway, regulation of interferon-beta production, defense response to virus, etc.) and cell signaling (e.g. signal transduction protein, nervous system axonogenesis, activation involved response). All the gene sets in these themes are down-regulated, which fits with our models where the presence of dexamethasone downregulates immune system responses and cell signaling, which is the mechanism for its anti-inflammatory property.

Two novel themes that are shown in the annotated clusters and the theme network, but not in the top hits of the models, are ‘p53 class mediator’ and ‘sequestering calcium ion’. Additional discussion regarding these findings can be found in question 2 of Interpretation and Detailed View of Results.

4 Interpretation and Detailed View of Results

Background and pre-processing of the data set can be found here:

Questions from previous sections can be found here:

1. Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?

Two novel themes that are shown in the annotated clusters and the theme network, but not in the top hits of the models, are ‘p53 class mediator’ and ‘sequestering calcium ion’. p53 is a tumor suppressor protein that regulates cell division. In a paper by Sengupta et al. (2000), it is said that dexamethasone-activated endogenous and exogenous glucocorticoid receptor inhibit p53-dependent , so it makes sense to see this theme down-regulated in our network.

Using your networks and results from the previous section add one of the following:

I chose to answer question 3, regarding dark matter.

4.1 Dark Matter

3. Sometimes the most interesting information is the gene that has no information. In this type of pathway analysis we can only discover what we have already described previously in the literature or pathway databases. Often pathways found in one disease are applicable to other diseases so this technique can be very helpful. It is important to highlight any genes that are significantly differentially expressed in your model but are not annotated to any pathways. We refer to this set of genes as the dark matter.

# extract out the sets of genes required for the Venn diagram.
gene_sets <- fgsea::gmtPathways("Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt")
Annotated <- unique(unlist(gene_sets))
Enriched <- unique(unlist(gene_sets[c(gsea_upreg$NAME, gsea_downreg$NAME)]))
genes_expressed <- readRDS("raw_fltr_matrix.rds")
Expressed <- unique(rownames(genes_expressed))
venn_diag <- draw.triple.venn(area1 = length(Annotated), area2 = length(Enriched),
    area3 = length(Expressed), n12 = length(intersect(Annotated, Enriched)), n13 = length(intersect(Annotated,
        Expressed)), n23 = length(intersect(Enriched, Expressed)), n123 = length(intersect(Annotated,
        intersect(Enriched, Expressed))), category = c("Annotated", "Enriched", "Expressed"),
    fill = hcl.colors(3, palette = "Sunset"), fontfamily = "Arial", cat.fontfamily = "Arial",
    cat.cex = 1.2, )
grid::grid.draw(venn_diag)

Per the Venn diagram, 9479 out of 11593 filtered expressed genes (81.76%) were annotated and enriched. This is ideal as it is an indication that the outputs of our enrichment analyses is a good representation of the expression data. Here, we’ll focus on the 2114 genes that were not enriched, which contains 1620 expressed genes that are not enriched nor annotated, and 494 genes that are annotated but not enriched.

# set of genes that are annotated but not enriched
anno_no_enrich <- setdiff(intersect(Expressed, Annotated), Enriched)
anno_no_enrich <- et_output[anno_no_enrich, ]
# filter out genes that are not significantly differentially expressed
anno_no_enrich <- anno_no_enrich[which(anno_no_enrich$FDR < 0.05), ]

# set of genes with no annotation
no_anno <- setdiff(Expressed, Annotated)
no_anno <- et_output[no_anno, ]
# filter out genes that are not significantly differentially expressed
no_anno <- no_anno[which(no_anno$FDR < 0.05), ]
rmarkdown::paged_table(anno_no_enrich[order(anno_no_enrich$rank, decreasing = TRUE),
    c("FDR", "rank")])
rmarkdown::paged_table(no_anno[order(no_anno$rank, decreasing = TRUE), c("FDR", "rank")])
# creating normalized CPM for heatmap
dge <- edgeR::DGEList(counts = genes_expressed, group = c("CTRL", "CTRL", "CTRL",
    "DEX", "DEX", "DEX"))

# calculate normalization factor
dge <- calcNormFactors(dge)
# get the normalized data
normalized_cpm <- cpm(dge)

3.i. Include a heatmap of any significant genes that are not annotated to any of the pathways returned in the enrichment analysis.

# creating heatmap matrix
heatmap_matrix_1 <- t(scale(t(normalized_cpm[rownames(anno_no_enrich), ])))
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_1), 0, max(heatmap_matrix_1)),
    c("blue", "white", "red"))

# creating treament annotation
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")
treatment_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
    "DEX"), c(3, 3))), col = list(treatment = treat_colours))

# Create the heatmap
heatmap_anno_no_enrich <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_1), cluster_rows = TRUE,
    cluster_columns = TRUE, show_row_dend = TRUE, show_column_dend = TRUE, col = heatmap_col,
    show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
    top_annotation = treatment_col, column_title = "Differential Gene Expressions of Genes That Are Annotated but Not Enriched",
    name = "row-normalized \ngene expression")
heatmap_anno_no_enrich

As we can see from the heatmap, there are some distinction of expression pattern between the control and conditioned group. This distinction is not definite (e.g. on the bottom left, we can see a few areas where genes are up-regulated in two of the three samples but down-regulated in the other). Although the distinction does exist, this heat map only contains 28 genes (significantly differentially expressed, annotated, but not enriched), and is unlikely to contain any major significant pathways that we did not already identify.

3.ii. Include a heatmap of any significant genes that are not annotated to any pathways in entire set of pathways used for the analysis.

# creating heatmap matrix
heatmap_matrix_2 <- t(scale(t(normalized_cpm[rownames(no_anno), ])))
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_2), 0, max(heatmap_matrix_2)),
    c("blue", "white", "red"))

# creating treatment annotation
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")
treatment_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
    "DEX"), c(3, 3))), col = list(treatment = treat_colours))

# Create the heatmap
heatmap_no_anno <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_2), cluster_rows = TRUE,
    cluster_columns = TRUE, show_row_dend = TRUE, show_column_dend = TRUE, col = heatmap_col,
    show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
    top_annotation = treatment_col, column_title = "Differential Gene Expressions of Genes That Are Not Annotated",
    name = "row-normalized \ngene expression")
heatmap_no_anno

Similar to the previous heatmap (annotated but not enriched), this heatmap also clustered some distinction in gene expression patterns between the two groups, but each cluster is not very homogeneous. For example, there exists bands of blue (downregulated) genes in the red area (cluster of up-regulated genes).Although the distinction does exist, this heat map only contains 142 genes (significantly differentially expressed but not annotated). A solution to these dark matter genes is to use a larger set of pathways that could potentially include these genes, but the downside to this is that it could also introduce less-relevant pathways in our network result.


References

1. Xie Y. Knitr: A general-purpose package for dynamic report generation in r. 2023.
2. haozhu233. Haozhu233/kableextra: Construct complex table with knitr::kable() + pipe.
3. Quatrini L, Tumino N, Besi F, Ciancaglini C, Galaverna F, Grasso AG, et al. Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor. Journal of Allergy and Clinical Immunology. 2022;149:1772–85.
4. Quatrini L, Tumino N, Besi F, Ciancaglini C, Galaverna F, Grasso AG, et al. GSE186950: Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor. 2021.
5. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the r/bioconductor package biomaRt. Nature Protocols. 2009;4:1184–91.
6. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2009;26:139–40.